Hierarchical Text Categorization and Its Application to Bioinformatics
نویسنده
چکیده
In a hierarchical categorization problem, categories are partially ordered to form a hierarchy. In this dissertation, we explore two main aspects of hierarchical categorization: learning algorithms and performance evaluation. We introduce the notion of consistent hierarchical classification that makes classification results more comprehensible and easily interpretable for end-users. Among the previously introduced hierarchical learning algorithms, only a local top-down approach produces consistent classification. The present work extends this algorithm to the general case of DAG class hierarchies and possible internal class assignments. In addition, a new global hierarchical approach aimed at performing consistent classification is proposed. This is a general framework of converting a conventional “flat” learning algorithm into a hierarchical one. An extensive set of experiments on real and synthetic data indicate that the proposed approach significantly outperforms the corresponding “flat” as well as the local top-down method. For evaluation purposes, we use a novel hierarchical evaluation measure that is superior to the existing hierarchical and non-hierarchical evaluation techniques according to a number of formal criteria. Also, this dissertation presents the first endeavor of applying the hierarchical text categorization techniques to the tasks of bioinformatics. Three bioinformatics problems are addressed. The objective of the first task, indexing biomedical articles with Medical Subject Headings (MeSH), is to associate documents with biomedical concepts from the specialized vocabulary of MeSH. In the second application, we tackle a challenging problem of gene functional annotation from biomedical literature. Our experiments demonstrate a considerable advantage of hierarchical text categorization techniques over the “flat” method on these two tasks. In the third application, our goal is to enrich the analysis of plain experimental data with biological knowledge. In particular, we incorporate the functional information on genes directly into the clustering process of microarray data with the outcome of an improved biological relevance and value of clustering results.
منابع مشابه
A New Co-similarity Measure : Application to Text Mining and Bioinformatics. (Une Nouvelle Mesure de Co-Similarité : Applications aux Données Textuelles et Génomique)
Clustering is the unsupervised classification of patterns (observations, data items, or feature vectors) into groups (clusters). The clustering problem has been addressed in many contexts and there exist a multitude of different clustering algorithms for different settings. As datasets become larger and more varied, adaptations of existing algorithms are required to maintain the quality of clus...
متن کاملData Mining Process Using Clustering : A Survey
Clustering is a basic and useful method in understanding and exploring a data set. Clustering is division of data into groups of similar objects. Each group, called cluster, consists of objects that are similar between themselves and dissimilar to objects of other groups. Interest in clustering has increased recently in new areas of applications including data mining, bioinformatics, web mining...
متن کاملExperiments with HITEC: a Hierarchical Text Categorizer
This paper presents experiments on the effectiveness of HITEC software (HIerarchical TExt Categorizer) on several natural languages (English, German) and with various kinds of text corpora. HITEC applies UFEX (Universal Feature EXtractor) method for hierarchical text categorization. Based on the obtained results shows that HITEC outperforms its known competitors on the investigated corpora, its...
متن کاملHierarchical text categorization using fuzzy relational thesaurus
Text categorization is the classification to assign a text document to an appropriate category in a predefined set of categories. We present a new approach for the text categorization by means of Fuzzy Relational Thesaurus (FRT). FRT is a multilevel category system that stores and maintains adaptive local dictionary for each category. The goal of our approach is twofold; to develop a reliable t...
متن کاملString Matching and its Applications in Diversified Fields
String searching algorithms, sometimes called string matching algorithms, are an important class of string algorithms that try to find a place where one or several strings (also called patterns) are found within a larger string or text.[11] String matching is a classical problem in computer science. In this paper we are trying to explore the various diversified fields where string matching has ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2005